Side cache array for greater fetch bandwidth

ABSTRACT

In one embodiment, a microprocessor, comprising: an instruction cache configured to receive an instruction fetch comprising a first byte portion and a second byte portion; a side cache tag array configured to signal further processing of the second byte portion in addition to the first byte portion based on a hit of the side cache tag array; and a side cache data array configured to store instruction data for the second byte portion.

TECHNICAL FIELD

The present invention relates in general to microprocessors, and inparticular, instruction fetch bandwidth in microprocessors.

BACKGROUND

Microprocessors include one or more execution units that perform theactual execution of instructions. Superscalar processors include theability to issue multiple instructions per clock cycle to the variousexecution units to improve the throughput, or average instructions perclock cycle, of the processor. The instruction fetch and decodingfunctions at the top of the microprocessor pipeline should provide aninstruction stream to the execution units at a sufficient rate toutilize the additional execution units and actually improve thethroughput.

The x86 architecture makes this task more difficult because theinstructions of the instruction set are not fixed length; rather, thelength of each instruction may vary. Thus, an x86 microprocessor needsto include an extensive amount of logic to process the incoming streamof instruction bytes to determine where each instruction starts andends. Today's microprocessors typically fetch sixteen (16) bytes of dataper cycle, since fetch lengths greater than sixteen impose considerabletiming constraints in instruction formatting, such as determininginstruction boundaries and prefix information, particularly as clockspeeds rise. Further, the need for fetches beyond 16 bytes/cycle hastraditionally not been a common requirement. However, the increasingpopularity of multimedia in many types of digital devices has lead to aconcomitant, seemingly annual, increase in multimedia instructions, andthus some chip manufacturers have used different approaches to handlingfetches beyond 16 bytes (e.g., 32 byte fetches). Unfortunately,solutions have generally resulted in the need for wholesale recoverymechanisms based on errors when encountering self-modifying code or somealias cases, or large and enormously complicated caches withlower-than-expected performance. Thus there is a need to handle fetchesbeyond 16 bytes without sacrificing performance.

SUMMARY

In one embodiment, a microprocessor, comprising: an instruction cacheconfigured to receive an instruction fetch comprising a first byteportion and a second byte portion; a side cache tag array configured tosignal further processing of the second byte portion in addition to thefirst byte portion based on a hit of the side cache tag array; and aside cache data array configured to store instruction data for thesecond byte portion.

Other systems, methods, features, and advantages of the presentdisclosure will be or become apparent to one with skill in the art uponexamination of the following drawings and detailed description. It isintended that all such additional systems, methods, features, andadvantages be included within this description, be within the scope ofthe present disclosure, and be protected by the accompanying claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of the invention can be better understood with referenceto the following drawings. The components in the drawings are notnecessarily to scale, with emphasis instead being placed upon clearlyillustrating the principles of the present invention. Moreover, in thedrawings, like reference numerals designate corresponding partsthroughout the several views.

FIG. 1A is a block diagram showing an embodiment of an example sidecache array system used in a microprocessor pipeline.

FIG. 1B is a schematic diagram showing an example front end of themicroprocessor pipeline shown in FIG. 1A.

FIG. 2 is a schematic diagram that shows an embodiment of exampleexpansion logic used in a side cache array system.

FIG. 3 is a flow diagram that shows an embodiment of an example sidecache array method.

DETAILED DESCRIPTION

Certain embodiments of a side cache array system and method aredisclosed that enable the efficient processing by a microprocessor ofgroups of instructions totalling more than sixteen (16) bytes in length,such as those found in multimedia code. In one embodiment, a side cacheis implemented that only stores instruction information or data (e.g.,instruction boundaries, prefix information, etc.) for the second byteportion (e.g., the second half) of a thirty-two (32) byte fetch, whileallowing the regular logic to process the first portion (e.g., firsthalf) of the 32 byte fetch. The tag and data array of the side cachereside in different pipe stages, where the side cache tag array is readearly. A hit in the side cache tag array results in an increment of 32bytes in the sequential fetch address of the instruction cache (I-cache)and staging of that data down to an XIB queue. Later, that hit to theside cache tag array also results in the instruction information to bewritten into the XIB queue with the calculated first byte portion.Through the use of the side cache, fetches of 32 bytes can be handledwithout the errors or large cache sizes found in other methods used tohandle 32 byte fetches. Generally, certain embodiments of the side cachearray system provides for better throughput in the presence of longinstructions (e.g., AVX-type instructions, which can be 6 to 11 byteslong) that often result in a 4-instruction group exceeding 16 bytes.

Digressing briefly, though other mechanisms have been established forhandling 32 byte fetches, there are shortcomings to those approaches.For instance, one method performs a slow scan of 16 bytes per fetch andthen accumulates the instructions and instruction boundaries determinedfrom those scans in the same cache. However, such a method is vulnerableto self modifying code or alias cases that render the start and endmarks in error, requiring a slow and potentially error-prone recoveryprocess. In some methods, a micro-op cache is created to enable morethroughput. For instance, the micro-op cache serves as an independent,front end replacement (e.g., of the I-cache) with higher bandwidth(e.g., maximum of 4 micro-ops/cycle, or 6 micro-ops/cycle). However, thecache is very large and complex, and to incorporate such a solution iseffectively requiring a re-design of most of the pipeline for manymicroprocessors. In contrast, certain embodiments of a side cache arraysystem address the need for fetches of greater than 16 bytes by wideningthe I-cache fetch and using the side cache array to store thestart/end/prefix information for the second portion of the 32 byte fetchwhile allowing the regular (e.g., L stage and M stage) logic to processthe first portion, providing a simple approach using space savingtechniques while enabling greater throughput (e.g., enables issuance offour x86 instructions/cycle for critical loops, even for instructionshaving an average length of eight (8) bytes long).

Having summarized certain features of a side cache array system of thepresent disclosure, reference will now be made in detail to thedescription of a side cache array system as illustrated in the drawings.While a side cache array system will be described in connection withthese drawings, there is no intent to limit it to the embodiment orembodiments disclosed herein. That is, while the invention issusceptible to various modifications and alternative forms, specificembodiments thereof are shown by way of example in the drawings and willherein be described in detail sufficient for an understanding of personsskilled in the art. It should be understood, however, that the drawingsand detailed description thereto are not intended to limit the inventionto the particular form disclosed. On the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope of the present invention as defined by the appendedclaims. As used throughout this application, the word “may” is used in apermissive sense (i.e., meaning having the potential to), rather thanthe mandatory sense (i.e., meaning must). Similarly, the words“include”, “including”, and “includes” mean including, but not limitedto.

Various units, modules, circuits, logic, or other components may bedescribed as “configured to” perform a task or tasks. In such contexts,“configured to” is a broad recitation of structure generally meaning“having circuitry or another physical structure that” performs, or iscapable of performing, the task or tasks during operations. Thecircuitry may be dedicated circuitry, or more general processingcircuitry operating under the control of coded instructions. That is,terms like “unit”, “module”, “circuit”, “logic”, and “component” may beused herein, in describing certain aspects or features of variousimplementations of the invention. It will be understood by personsskilled in the art that the corresponding features are implementedutilizing circuitry, whether it be dedicated circuitry or more generalpurpose circuitry operating under micro-coded instruction control.

Further, the unit/module/circuit/logic/component can be configured toperform the task even when the unit/module/circuit/logic/component isnot currently in operation. Reciting aunit/module/circuit/logic/component that is configured to perform one ormore tasks is expressly intended not to invoke 35 U.S.C. § 112(f) forthat unit/module/circuit/logic/component. In this regard, personsskilled in the art will appreciate that the specific structure orinterconnections of the circuit elements will typically be determined bya compiler of a design automation tool, such as a register transferlanguage (RTL) compiler. RTL compilers operate upon scripts that closelyresemble assembly language code, to compile the script into a form thatis used for the layout or fabrication of the ultimate circuitry.

That is, integrated circuits (such as those of the present invention)are designed using higher-level software tools to model the desiredfunctional operation of a circuit. As is well known, “Electronic DesignAutomation” (or EDA) is a category of software tools for designingelectronic systems, such as integrated circuits. EDA tools are also usedfor programming design functionality into field-programmable gate arrays(FPGAs). Hardware descriptor languages (HDLs), like Verilog and veryhigh-speed integrated circuit hardware description language (VHDL) areused to create high-level representations of a circuit, from whichlower-level representations and ultimately actual wiring can be derived.Indeed, since a modern semiconductor chip can have billions ofcomponents, EDA tools are recognized as essential for their design. Inpractice, a circuit designer specifies operational functions using aprogramming language like C/C++. An EDA software tool converts thatspecified functionality into RTL. Then, a hardware descriptor language(e.g. Verilog) converts the RTL into a discrete netlist of gates. Thisnetlist defines the actual circuit that is produced by, for example, afoundry. Indeed, these tools are well known and understood for theirrole and use in the facilitation of the design process of electronic anddigital systems, and therefore need not be described herein.

FIG. 1A shows an embodiment of an example pipeline for a microprocessor10. It should be appreciated that certain known components of amicroprocessor 10 are omitted here for brevity and ease of explanationand illustration. As is known, the pipeline architecture provides formultiple instructions that are overlapped in execution, with each stagereferred to as a pipe stage. The blocks shown in the pipeline may eachbe implemented according to one or more stages, those stages shown tothe left of the blocks and represented in the depicted embodiment by theupper-case letters C, I, B, U, L, M, F, G, W, X, E, S, W, Y, and Z thatare sequentially advanced from top-down and as redirected (as shown bythe arrows). It should be appreciated by one having ordinary skill inthe art that the number and/or arrangement of stages depicted in FIG. 1Ais merely illustrative of one example embodiment, and that in someembodiments, a different number and/or arrangement of stages may beimplemented and hence contemplated to be within the scope of thedisclosure. It should also be appreciated by one having ordinary skillin the art that the blocks provide a general description offunctionality for the pipeline, and that associated logic or circuitryknown to those having ordinary skill in the art is omitted here forbrevity. For instance, it should be appreciated by one having ordinaryskill in the art that each stage of the pipeline may be separated byclocked pipeline registers or latches, as is known.

In one embodiment, the microprocessor 10 comprises an I-cache tag array12, an I-cache data array 14, a side cache tag array 16, and a sidecache data array 18. The microprocessor 10 further comprises alength/prefix (L/PF) scan logic 20, expand logic 22, an instruction mux(M) queue 24, and an XIB mux (M) queue 26. In one embodiment, theI-cache tag array 12, I-cache data array 14, side cache tag array 16,side cache data array 18, L/PF scan logic 20, expand logic 22,instruction M queue 24, and XIB M queue 26 comprise the side cache arraysystem, though in some embodiments, fewer or more logic components maymake up the side cache array system. The microprocessor 10 furthercomprises an instruction formatter 28, a formatted instruction queue(FIQ)/loop queue 30, a translate logic 32, register aliastable/reservation stations (RAT/RS) 34, execution units 36, and retirelogic 38.

In one embodiment, the I-cache tag array 12 and the side cache tag array16 are implemented at the C stage. Referring to FIG. 1B, shown areexample sources used at a front end 40 of the pipeline shown for themicroprocessor 10 of FIG. 1A. The front end 40 comprises a fetch unit 42(e.g., including a mux and clocked register), the side cache tag array16, a translation lookaside buffer (TLB) 44, the I-cache data array 14,the I-cache tag array 12, a branch target access cache (BTAC) 46 (e.g.,part of the pipeline but not shown in FIG. 1A), a quick predictor 48(e.g., part of the pipeline as well but not shown in FIG. 1A), and a mux50. The fetch unit 42 receives plural sources of cache instructionaddresses, including a sequenced instruction address, correctedinstruction address (e.g., from the S stage), decode time instructionaddress (e.g., from the G stage), and addresses from the BTAC 46 andquick predictor 48. The output of the fetch unit 42 is a cache addressthat is provided as inputs to the side cache tag array 16, the TLB 44,I-cache data array 14, I-cache tag array 12, BTAC 46, and quickpredictor 48 for accessing the next instruction of the I-cache dataarray 14.

Digressing briefly, the quick predictor 48 comprises a single cyclebranch predictor that provides for single cycle prediction (e.g., takesone cycle to produce a target address, the prediction provided at the Istage in one embodiment). In one embodiment, the quick predictor 48comprises a table (also referred to herein as array or target array)that stores branch target addresses of previously executed branchinstructions, the table enabling a branch prediction when the storedbranch instructions are subsequently encountered. In one embodiment, thetable comprises 128 entries, though tables of other sizes (e.g., 64entries, 32 entries, etc.) may be used in some embodiments. The table isorganized as an n-way (e.g., n is an integer greater than one) setassociative cache. Generally, an n-way set associative cache is alsoreferred to herein as a multi-set associative cache. In one embodiment,each entry stores eight (8), 3-bit counters and the current local branchpattern, the counter chosen by a 3-bit local branch pattern. The quickpredictor 48 further comprises a conditional branch predictor that isaccessed in parallel with the table and that provides a taken/not takendirection for conditional branches. The quick predictor 48 furthercomprises a return stack that can provide a target instead of the table.In one embodiment, the return stack comprises four (4) entries andprovides the target for return instructions. Note that thespecifications listed above are merely for illustration, and that someembodiments may perform under different specifications and hence arecontemplated to be within the scope of the invention. The quickpredictor 48 is configured to deliver a predicted branch targetimmediately (within a single cycle) with no taken branch penalty. Insome embodiments, the quick predictor 48 may operate according to otherspecifications for its prediction mechanism and/or table configuration,or in some embodiments, may be omitted. Most branches are correctlypredicted by the quick predictor 48. In some embodiments, where thequick predictor 48 provides a branch prediction that differs (e.g.,difference in direction and/or target) from the branch prediction of theBTAC 46 based on the same fetched branch instruction, the BTAC 46overrides the branch prediction of the quick predictor 48 and updatesthe quick predictor table within the set of stages of the BTAC 46, forinstance, at the U stage, with the branch prediction information (e.g.,direction, target address, branch prediction type) provided by the BTAC46.

The I stage and/or B stage correspond to access to the various tables ofthe pipeline, including in some embodiments muxing out the direction orway from the tables (e.g., based on the tags) and reading out of theinstructions.

The BTAC 46 holds information about previously executed branchinstructions that it uses to predict the target address, direction, andtype during subsequent executions. The BTAC 46 comprises one or moretables that are much larger than the table of the quick predictor 48. Inone embodiment, the BTAC 46 comprises a 4 k entry, m-way set-associativetable (also referred to herein as array or target array), where m is aninteger greater than one. Similar to n-way set-associative tables, m-wayset-associative tables may also be referred to herein as multi-setassociative tables. Each entry of the BTAC 46 comprises a valid bit, abranch target address prediction, a direction prediction, and a branchtype. The branch type specifies whether the branch instruction is acall/return, indirect branch, conditional relative branch, orunconditional relative branch. In one embodiment, the BTAC 46 comprisesor cooperates with a conditional relative branch predictor (or simply,conditional branch predictor) having a multiple entry (e.g., 12 k)tagged geometric (TAGE)-based conditional branch predictor, multipletables, a multi-bit (e.g., 3 bit), taken/not taken (T/NT) counter, andmulti-bit global branch history. That is, the TAGE conditional branchpredictor comprises tagged tables with geometrically increasing branchhistory lengths, as is known. As another example, the indirectprediction comprises a multiple entry (e.g., 1.5 k) TAGE predictor anduses the table entries for static indirect branches. In one embodiment,two TAGE conditional branch predictors are used, one for side A and onefor side B in a predictor array. The TAGE conditional branch predictormay be part of the BTAC or used in conjunction with the BTAC 46.

The TLB 44, under management by a memory management unit (not shown),provides for a virtual to physical page address translation as is known.That is, the TLB 44 stores the physical addresses of the most recentlyused virtual addresses. The TLB 44 receives a linear address from asegmentation unit (which converts the logical address from a programinto the linear address), and a portion of the linear address iscompared to the entries of the TLB 44 to find a match. If there is amatch, the physical address is calculated from the TLB entry. If thereis no match, a page table entry from memory is fetched and placed intothe TLB 44.

The I-cache data array 14 comprises a level 1 cache of instructions thathave been fetched or prefetched from L2, L3 or main memory. The I-cachedata array 14 comprises multiple clocked registers.

The I-cache tag array 12 comprises an array of tags corresponding to theinstructions in the I-cache data array 14, and comprises multipleclocked registers, and is used to determine a match between informationassociated with the fetched cache instruction (e.g., the tag or portionof the cache address) to the I-cache data array 14 and BTAC 46.

More relevant to the side cache array system, the I-cache tag array 12and the side cache tag array 16 are implemented in some embodiments inparallel (e.g., at the C stage), along with the other processesincluding sending the address to the I-cache data array 14, TLB 44,quick predictor 48, and BTAC 46. Notably, the side cache tag array 16 isseparate from the side cache data array 18, the latter implemented in adifferent stage (e.g., the U stage). The I-cache data array 14 isconfigured to provide 32 bytes of data, but for most processes, handlesfetches in 16 bytes/cycle. A hit at the side cache tag array 16 signalsto the mux 50 to select 32 bytes (instead of 16 bytes), and thesequential address is incremented 32 bytes instead of 16. A miss at theside cache tag array 16 signals to the mux 50 to increment the addressby 16 bytes. In other words, the mux 50 is configured, based on whetherthere is a hit or not in the side cache tag array 16, to select either32 bytes or 16 bytes, where the sequential address is incrementedaccordingly to the fetch unit 42.

Referring again to FIG. 1A, before proceeding with additionaldescription of the side cache array system, the logic for handling thefirst 16 bytes (followed by the balance of the pipeline) is brieflydescribed. For the first 16 byes-bytes of data from the 32 byte fetch,the L/PF scan logic 20, XIB M queue 26, and the instruction formatter 28provide XIB and decode functionality associated with stages L (length),M (mux), and F (format) of the pipeline. The L/PF scan logic 20determines and marks the beginning and ending byte of each instruction(L stage) within the stream and thereby break breaks up the stream ofbytes into a stream of x86 instructions, which is staged at the XIB Mqueue 26 (M-stage) before providing to decoding functionality at the Fstage of the instruction formatter 28. Note that additional informationon XIB functionality and the L, M, and F stages may be found in U.S.Pat. No. 8,612,727, incorporated herein by reference in its entirety tothe extent consistent with the current disclosure.

The FIQ/loop queue 30 receives the formatted instructions and buffersthem until they can be translated into microinstructions. The FIQ/loopqueue 30 also provides for a preliminary decoding and fast loopingfunction (e.g., on a BTAC loop branch, the loop queue is activated andloop instructions are repeatedly sent).

The W stage provides for an optional extra timing clock.

At the X stage, the instruction translator 32 translates (in the X ortranslate stage) the formatted instructions stored in the FIQ/loop queue30 into microinstructions.

The instructions are provided in program order to a register aliastable/reservation station (RAT/RS) tables 34. The RAT functionality ofthe RAT/RS 34 maintains and generates dependency information for eachinstruction. The RAT functionality of the RAT/RS 34 renames the sourcesand destinations of the instructions onto internal registers, anddispatches the instructions to reservation stations of the RAT/RS 34,which issue the instructions, potentially out of program order, tofunctional units, or execution units (EUs) 36. The execution units 36,which include integer units, execute branch instructions at stage E(execution). Execution units, branch units, and integer units are termsthat are used interchangeably herein. In one embodiment, the executionunits 36 (e.g., two execution units) execute two branches in a singleclock cycle. The execution units 36 also indicate whether the BTAC 46has correctly predicted the branch instruction.

Results of the execution are provided to the retire logic 38. In oneembodiment, the retire logic 38 comprises a reorder buffer (not shown),which comprises information pertaining to instructions that have beenexecuted. As is known, the reorder buffer keeps the original programorder of instructions after instruction issue and allows resultserialization during a retire stage. In one embodiment, some of theinformation of the reorder buffer may be stored elsewhere along thepipeline, such as at the instruction formatter 28. Information stored inthe reorder buffer may include branch information, such as type ofbranch, branch patterns, targets, the tables used in the prediction, andcache replacement policy information (e.g., least recently used or LRU).The retire logic 38 may further comprise a branch table update, whichcomprises stages S, W, Y, and Z, and is configured to update (e.g., atthe S stage) the various tables at the front end (e.g., BTAC) withinformation about the fully decoded and executed branch instruction(e.g., the final result of the branch). The update may involve, atstages S, W Y, and Z, a table read, a target address write, and acounter increment or decrement, which may involve some delays. In oneembodiment, the branch table update provides an indication of amisprediction for a given conditional branch instruction and the side(e.g., A, B, or C) in which the conditional branch instruction iscached.

Referring again to relevant functionality for the side cache arraysystem, in one embodiment, the side cache data array 18 comprises 2-way,64 entry tables or arrays, each entry comprising 2 KB of instructiondata. The side cache data array 18 stores instruction boundaries (e.g.,start, end), accumulated prefixes, branch information (e.g., where theBTAC branches are in the fetch), and breakpoint marks. The instructiondata stored in the side cache data array 18 is stored in compressed formby storing markers for the second half 16 bytes in a manner that isapproximately half the size it otherwise would be (e.g., if stored inthe format of the XIB M queue 26). Responsive to a hit in the side cachetag array 16, instruction information or data (e.g., instructionboundaries, prefix information, etc.) associated with the latter half ofthe 32 byte fetch is staged to the side cache data array 18, processedby the expand logic 22, and written to the XIB M queue 26. The data forthe first 16 bytes of the 32 byte fetch is handled by the L/PF scanlogic 20 after which it is written to the XIB M queue 26, and the rawdata from the I-cache data array 14 (e.g., that is not stored in theside cache data array 18) is staged to the instruction data M queue 24.Note that processing of the I-cache data (e.g., for the first 16 bytes)is performed along the non-side cache branch (on the left side in FIG.1A) in standard fashion. The L/PF scan logic 20 determines instructionlengths and accumulates prefixes, which may amount to 10-15 differenttypes in x86 based instructions. For instance, based on a scan of theinstruction, prefixes are identified (e.g., at hex 66, 67, 2E, 3E,etc.), and may include OS, AS, REX presence and their variations, etc.,as described in U.S. Pat. No. 8,612,727, incorporated herein byreference in its entirety. The L/PF scan logic 20, in conjunction withthe XIB M queue 26, accumulates one or more of these prefixes andattaches or associates them with the opcode byte. Accordingly, the scanenables a determination of instruction length (e.g., starting at theopcode byte) and all of the prefixes affecting that instruction. TheL/PF scan logic 20 uses standard or normal L-stage logic, as explainedin the above-referenced patent incorporated by reference, to process thefirst half or portion of the 32 byte fetch (i.e., the first 16 bytes).That is, the L/PF scan logic 20 scans the information from the I-cachedata array 14, produces the appropriate markers, and writes theinformation to another entry of the XIB M queue 26. In other words, theL/PF scan logic 20 and the expand logic 22 write respective entries tothe XIB M queue 26.

Side cache entries are written, based on a prior scan, according tocertain conditions (e.g., side cache miss, odd 16 byte addresssignifying the second half of a 32 byte fetch, and not being the targetof a branch). In general, since the prefix information can addconsiderably to the size of each instruction (e.g., 15 bits per byte),as can the branch information (e.g., whether there is a branch, whetherit is taken or not taken), the total number of possible bits may be 20bits×16 bytes. Though some embodiments may store all of those bits inthe side cache data array 18, in some embodiments, only a fraction ofthe information is stored. The side cache data array 18 stores acompressed, per instruction version of this instruction information, andalso limits the amount of instructions where the side cache data array18 is utilized (e.g., 5 or fewer instructions, which in some embodimentsis programmable). In other words, one purpose of the side cache arraysystem is to handle long-length instructions where the typical 16-bytefetch bandwidth is not sufficient to handle a group of these types ofinstructions. To preserve the side cache entries for circumstances whereneeded (e.g., for 8-10 byte long instructions extracted from theinstruction cache data array 14), the side cache data array 18 isconfigured to store a limited amount of instruction data with enough bitcapacity in each entry to represent via bit representations the variousmarks—start, end, prefixes—per instruction byte. That the compressedformat enables the storage of, instead of sixteen sets of 15 bits, only5 sets of these bits, as described further below in association withFIG. 2. That is, the side cache data array 18 need only store 5 possiblestart bits, five possible end bits, 5 possible break point markers, and5 possible prefix markers in one embodiment. Instead of having 240 bitsfor all of the prefixes, there are only 75 bits. Accordingly, certainembodiments of a side cache array system are intended to handle apredetermined or programmable quantity of instructions, and for the sakeof illustration, a quantity of five instructions will be used as themaximum limit to handle in the side cache data array 18, with theunderstanding that other limits may be used in some embodiments. Ingeneral, the side cache data array 18 stores markers (e.g., bitrepresentations) for start and end for an instruction, breakpoints,branch information, and prefix information. Notably, the side cache dataarray 18, with storage of markers as opposed to I-cache data, is smallerthan an instruction cache with prefix information and various markersembedded therein.

The compressed instruction information from the side cache data array 18is then expanded by expand logic 22 to a format suitable for use by theXIB M queue 26. For instance, the expand logic 22, before writing to theXIB M queue 26, knows to attach start and/or end bits and otherinstruction information for each of the instruction bytes. In effect,the output of the expand logic 22 comprises the result of a length scan(mark every byte with a start or end byte), markings indicating whetherthere is a BTAC branch on it, whether there is a breakpoint, and anidentification of one or more prefixes associated with the instructionbyte. For instance, if the first instruction starts at (hex shorthand)byte 2, the prefix data is attached, and then on to the next instructionto determine whether certain bits need to be attached and so on. Theresult is an output to the XIB M queue 26 according to one entry forthis expanded information for the second half or portion of the 32 bytefetch.

The instruction M queue 24 tracks along with the XIB M queue 26, and inone embodiment, comprises a part of the XIB M queue 26. The instructionM queue 24 receives the raw, unmodified data from the instruction cachedata array 14. The instruction M queue 24 contains the instruction bytesfor staging down to decoders of the instruction formatter 28. For the 16byte fetch scenario, there is a single write entry to the instruction Mqueue 24. For the 32 byte fetch scenario, there are two entries writtento the instruction M queue 24.

In the XIB M queue 26, each byte has associated with it the expandedmarks including 19 bits for each 16 byte per entry and corresponding tostart, end, whether it is a branch, a data break point and prefix typeor types (e.g., OS, AS, 2E, 3E, segment override prefixes, etc.). Forinstance, 15 bits correspond to prefixes, and 4 bits for start, end,branch, and break point. The XIB M queue 26 further comprisesapproximately 6-12 entries in some embodiments. The XIB M queue 26 isread to feed the M stage where instructions are actually muxed and wholestages consumed for formatting at the F stage.

Control logic for certain embodiments of a side cache array systemprovides for certain checks on updates to the side cache array system.For instance, writes to a new side cache entry are implemented whenthere is a side cache tag array miss, the fetch involves an odd 16Baddress (e.g., signifying the second half of a 32 byte fetch) and is nota target of a branch (e.g., exclude the target of a branch because, inthat case, all of the start/end/prefix markers will not be availablesince the entire fetch is not scanned, just a portion after the branchtarget). For instance, when branching to a point in the middle of a 16byte fetch, a full scan for instruction boundaries will not occur, andas such, a full set of start, end, and prefix data will not be availableto write into the side cache data array 18. Accordingly, where the sidecache entry would involve the target of a branch, such is excluded froma side cache entry. Additionally, certain embodiments of a side cachearray system may limit (e.g., through the use of a feature controlregister, scan, fuse, etc.) the use of the side cache data array 18 toregions of code of a predetermined or programmable quantity ofinstructions per 16 bytes. For instance, where there are more than fiveinstructions per 16 bytes, to avoid exceeding byte fetch per clockbandwidths (e.g., 16 bytes per clock cycle fetch), the side cache dataarray entries may be limited to some predetermined or configurablenumber of instructions (e.g., 3-5). Also, a side cache data array entry,or in some embodiments, the entire side cache data array 18, isinvalidated on an I-cache data array cast out (e.g., where data isevicted from the I-cache data array 14, there needs to be acorresponding entry invalidated in the side cache data array 18, so asto avoid improper aliases), snoop invalidate (e.g., via a signal sentfrom the I-cache data array 14), TLB invalidate, or an OS/AS prefixdefault change (e.g., which affects instruction length). It is notedthat since the side cache array system works in parallel with theinstruction cache, it is known by the U stage both when there is a cachehit and if there is a cast out or invalidating event.

Referring now to FIG. 2, shown is a schematic illustration of side cacheentry expansion, such as performed by the expand logic 22. Inparticular, shown is the side cache data array 18, illustrations ofprefix buses 51 corresponding to five instructions with start and prefixinformation, demultiplexers 52 for the instructions, the XIB M queue 26,and decode OR logic 54. Note that the use of five instructions as acondition for implementing an embodiment of a side cache array system isused for illustration, and that other limits may be imposed in someembodiments. The data stored in the side cache data array 18 is in aformat shown representatively on the left hand side beneath the sidecache data array 18 in FIG. 2, and comprises end bit markers for bytes15 through 0 (end[15:0]), branch markings for bytes 15 through 0(branch[15:0]), and breakpoint markings for bytes 15 through 0(bpoint[15:0]). These markings are in a form used by the XIB M queue 26,and hence are also stored directly to the XIB M queue entry 56 as shown.The expand logic 22 decodes and performs a logical OR operation toexpand encoded start points for each of the five instructions into theXIB M queue format, as shown by the decode OR decode logic 54 andIns1start[3:0], Ins2start[3:0], . . . Ins5start[3:0]. For instance, thestart bits for the 5 instructions are decoded from, for instance, 4 bitversions into 16 bit versions. These 16 bit versions are ORed togetherto create the Start[15:0] format in the XIB M queue entry. Note thatkeeping the start bits per-instruction in this format, in the side cachedata array 18, does not save space, but it enables an expansion of theprefix buses into format suitable for the XIB M queue 26.

Focusing attention on the center of FIG. 2, from the side cache dataarray 18 are data for five instructions each comprising a 4 bit start[3:0] and a 15 bit prefix [14:0], and what is being illustrated is asteering of the prefix buses 51 using demultiplexers 52 to create aformat suitable for use in the XIB M queue 26 (e.g., prefixes attachedto each instruction, including Ins1 Prefix[14:0], Ins2Prefix[14:0], . .. Ins5Prefix[14:0]). For instruction 1, the start1 field has a valuebetween 0 and 15. Based on this value, the ins1 prefixes (PF1) issteered to one of the 16 sections of the XIB M queue entry 56 beingwritten. For instance, if Start1 [3:0]==1100, then PF1 [14:0] is steeredto the byte 12 section of the XIB M queue entry 56 being written. Thesame is done for each of the five instructions. The data from each ofthe instructions is ORed together to create the M queue entry writedata. That is, each demultiplexer (demux) 52 feeds the inputs to the XIBM queue 26 (e.g., the 16 sections of the XIB M queue 26 input that theIns1 demux feeds are also fed by demuxes for Ins2, Ins3, etc., which areORed together (e.g., in RTL)).

Note that only a single entry of the XIB M queue 26 is shown. In apractical implementation, there may be six (6) or more entries in thisqueue. Each byte needs 1 bit each stored for start, end, branch, andbreakpoint plus 15 bits for prefixes. In one embodiment, each entry is16×19 or 304 bits wide that may also include some other data that isinconsequential to this description.

In view of the above description, it should be appreciated by one havingordinary skill in the art that a side cache array method, denoted method58 in FIG. 3 and implemented in one embodiment by the microprocessor,comprises: receiving at an instruction cache an instruction fetchcomprising a first byte portion and a second byte portion (60);signaling further processing by a side cache tag array of the secondbyte portion in addition to the first byte portion based on a hit of theside cache tag array (62); and storing at a side cache data arrayinstruction data for the second byte portion (64).

Any process descriptions or blocks in flow diagrams should be understoodas representing modules, segments, logic, or portions of code whichinclude one or more executable instructions for implementing specificlogical functions or steps in the process, and alternate implementationsare included within the scope of the embodiments in which functions maybe executed out of order from that shown or discussed, includingsubstantially concurrently or in different order, depending on thefunctionality involved, as would be understood by those reasonablyskilled in the art of the present disclosure.

While the invention has been illustrated and described in detail in thedrawings and foregoing description, such illustration and descriptionare to be considered illustrative or exemplary and not restrictive; theinvention is not limited to the disclosed embodiments. Other variationsto the disclosed embodiments can be understood and effected by thoseskilled in the art in practicing the claimed invention, from a study ofthe drawings, the disclosure, and the appended claims.

Note that various combinations of the disclosed embodiments may be used,and hence reference to an embodiment or one embodiment is not meant toexclude features from that embodiment from use with features from otherembodiments. In the claims, the word “comprising” does not exclude otherelements or steps, and the indefinite article “a” or “an” does notexclude a plurality.

The invention claimed is:
 1. A microprocessor, comprising: aninstruction cache configured to receive an instruction fetch comprisinga first byte portion and a second byte portion, wherein the first byteportion and the second byte portion are different parts of theinstruction fetch; a side cache tag array configured to signalprocessing of the second byte portion in addition to the first byteportion based on a hit of the side cache tag array; and a side cachedata array configured to store instruction data for the second byteportion, wherein the instruction data stored in the side cache arraycomprises an indication of instruction boundaries, an indication ofaccumulated prefixes, an indication of a branch, and an indication ofbreakpoint marks.
 2. The microprocessor of claim 1, wherein the sidecache data array is configured to store the instruction data incompressed form.
 3. The microprocessor of claim 2, further comprisingexpand logic configured to expand the instruction data from compressedform to expanded form.
 4. The microprocessor of claim 3, wherein theexpanded form comprises the instruction data in a format suitable forstorage in an XIB mux queue.
 5. The microprocessor of claim 4, furthercomprising the XIB mux queue, the XIB mux queue configured to receivethe expanded instruction data corresponding to the second byte portionfrom the expand logic.
 6. The microprocessor of claim 1, furthercomprising processing the instruction data of the side cache data array,wherein the side cache data array is located at a later stage in apipeline of the microprocessor than the side cache tag array.
 7. Themicroprocessor of claim 1, further comprising length and prefix scanlogic configured to process instruction cache data corresponding to thefirst byte portion by performing a length determination and prefixscanning of the instruction cache data corresponding to the first byteportion, the instruction cache data comprising more bits of informationthan the side cache data.
 8. The microprocessor of claim 7, furthercomprising an XIB mux queue configured to receive the processedinstruction cache data.
 9. The microprocessor of claim 1, furthercomprising a mux queue configured to receive raw instruction data fromthe instruction cache corresponding to the first byte portion and thesecond byte portion.
 10. The microprocessor of claim 1, wherein thefirst byte portion and the second byte portion each comprises sixteen(16) bytes.
 11. The microprocessor of claim 1, wherein the instructionfetch comprising the first byte portion and the second byte portioncorresponds to instructions for multimedia processing.
 12. A methodimplemented by a microprocessor, the method comprising: receiving at aninstruction cache an instruction fetch comprising a first byte portionand a second byte portion, wherein the first byte portion and the secondbyte portion are different parts of the instruction fetch; signalingprocessing of the second byte portion, in addition to processing of thefirst byte portion, based on a hit of a side cache tag array; andstoring at a side cache data array instruction data for the second byteportion, wherein the instruction data stored in the side cache arraycomprises an indication of instruction boundaries, an indication ofaccumulated prefixes, an indication of a branch, and an indication ofbreakpoint marks.
 13. The method of claim 12, further comprising storingthe instruction data in the side cache data array in compressed form.14. The method of claim 13, further comprising expanding the instructiondata from compressed form to expanded form.
 15. The method of claim 14,wherein the expanding comprises formatting the instruction data in aformat suitable for storage in an XIB mux queue, further comprisingreceiving at the XIB mux queue the expanded instruction datacorresponding to the second byte portion.
 16. The method of claim 12,further comprising processing the instruction data of side cache dataarray after the hit at the side cache tag array.
 17. The method of claim12, further comprising determining a length of, and an identification ofa prefix of, the instruction cache data based on scanning theinstruction cache data corresponding to the first byte portion, theinstruction cache data comprising more bits of information than the sidecache data.
 18. The method of claim 17, further comprising: receiving atan XIB mux queue the processed instruction cache data; and receiving ata mux queue raw instruction data from the instruction cachecorresponding to the first byte portion and the second byte portion.